Air pollutant prediction model based on transfer learning two-stage attention mechanism

Atmospheric pollution significantly impacts the regional economy and human health, and its prediction has been increasingly emphasized. The performance of traditional prediction methods is limited due to the lack of historical data support in new atmospheric monitoring sites. Therefore, this paper proposes a two-stage attention mechanism model based on transfer learning (TL-AdaBiGRU). First, the first stage of the model utilizes a temporal distribution characterization algorithm to segment the air pollutant sequences into periods. It introduces a temporal attention mechanism to assign self-learning weights to the period segments in order to filter out essential period features. Then, in the second stage of the model, a multi-head external attention mechanism is introduced to mine the network's hidden layer key features. Finally, the adequate knowledge learned by the model at the source domain site is migrated to the new site to improve the prediction capability of the new site. The results show that (1) the model is modeled from the data distribution perspective, and the critical information within the sequence of periodic segments is mined in depth. (2) The model employs a unique two-stage attention mechanism to capture complex nonlinear relationships in air pollutant data. (3) Compared with the existing models, the mean absolute error (MAE), root mean square error (RMSE), and mean absolute percentage error (MAPE) of the model decreased by 14%, 13%, and 4%, respectively, and the prediction accuracy was greatly improved.

1.In this paper, the characteristics of air pollutant concentration data with strong periodicity, continuity, and non-stationarity are taken into account, and the TDC algorithm is utilized to segment the sequence and learn the characteristics between the periods.2. In order to better mine the potential information of the input data and capture the complex features of the data, the temporal attention mechanism and multiple external attention mechanisms are embedded in the temporal distribution matching layer.Through the temporal attention mechanism layer, the importance of different periods is determined, and the corresponding weights are assigned to obtain a better model input.
In order to dig deeper into the critical information in the hidden layer of BiGRU and extract the temporal characteristics between different units, the temporal dependence between units is captured by embedding a multi-head external attention mechanism layer after the BiGRU layer, which assigns different attention to the important information in the hidden layer, and then learns the critical information inside the model.3. The BiGRU model incorporating a two-stage attention mechanism is combined with transfer learning, and the source domain data determined by the Multiple Kernel Maximum Mean Discrepancy (MK-MMD) is used to pre-train the model to determine the optimal network parameters.In the transfer phase, the target domain data is used to fine-tuning the pre-training model to improve the generalization ability further.Through comparative analysis of prediction performance on sites lacking historical data, the TL-AdaBiGRU model is superior to Transformer, AdaBiGRU, BiGRU, GRU, and LightGBM models in prediction effect.

Air pollutants prediction approach Air pollutants prediction framework
The air pollutant concentration prediction framework proposed in this paper is shown in Fig. 1 below.It can be divided into a pre-training stage and a transfer stage.For the pre-training stage, firstly, the pollutant concentration data and meteorological data are detected anomalously, and the detected anomalies are marked as missing values and the linear interpolation algorithm is used to fill in the missing data, after which the data are normalized.Secondly, the preprocessed data are fed into the temporal distribution characterization layer (TDC).The design of the TDC is inspired by the principle of maximum entropy, which divides the time series into ten parts uniformly and uses a greedy strategy to divide the length n j of each cycle, thus dividing the data into K periods with large distribution gaps.This design aims to reduce the effect of data periodicity and helps the model better learn each time period's internal information.Next, the first-stage attention mechanism-temporal attention mechanism is used to assign weights a to each temporal data x i according to the importance of the temporal data in order to pay full attention to the feature information in the time-series data.Finally, the product of each temporal data x i and the attention α , f i , is used as the input to the BiGRU network.The hidden layer of BiGRU can efficiently capture the sequence data's long-term dependencies and effectively fuse forward and backward information to generate more comprehensive and accurate feature representations.A second-stage attention mechanism, the Multihead External Attention Mechanism, is embedded behind the hidden layer of BiGRU to dig deeper into the key features of the network's hidden layer.The composition of the multi-head external attention mechanism consists of two independent memory units,M K and M v , which are used as keys and values, respectively.They can learn additional data features and prior knowledge to assist the model in feature selection and weighting, quickly filtering out the key features among numerous inputs.Finally, the source domain pre-training is completed using the fully connected layer.In the transfer stage, the parameters from the pre-training phase are used as the basis for the transfer learning using the fine-tuning strategy.obtain the new AdaBiGRU model.The new AdaBiGRU model contains the pre-trained AdaBiGRU layer of the source domain and the thawed AdaBiGRU layer (without weight update).Finally, we fine-tuned the AdaBiGRU model using the preprocessed target site data to optimize the remaining parameters.We applied the optimal TL-AdaBiGRU model to predict air pollutant concentrations at the target site and output the final prediction results.

Two-stage attention mechanisms neural networks
We propose AdaBiGRU, consisting mainly of a temporal distribution characterization module (TDC) and a temporal distribution matching module (TDM).The role of the TDC module is to quantify the successive data distributions in a sequence and classify them into sequences with the least similar K segment distributions.The role of the TDM module is to construct a model with temporal invariance for the above K-segment sequences.
The details are given below.

Temporal distribution characterization
Atmospheric pollutant concentration data are typical time series data with periodicity and non-stationarity, and the data distribution changes dynamically with time.This paper defines the problem above as Temporal Covariate Shift (TCS).TCS means that there are n marked parts in a period of time D .If we can divide it into K period segments, that is, D ={D 1 , D 2 ,..., D K } , where It is referred to the case that all the segments in the same period follow the same data distribution P D i x, y ,while for different time periods 1 ≤ i � = j ≤ K, P D i (x) � = P Dj (x) and P D i y|x = P Dj y|x .As shown in Fig. 2 below, the data have different distributions in intervals A, B, C and D, that is,P A (x) = P B (x) = P C (x) = P Test (x) .Especially during our training process, the distribution of the test data and the training data are also different, so how to solve the differences between the data distributions while capturing the common knowledge of the time series data between different periods to make the prediction model generalize more is the primary problem.
One approach of existing studies for the above scenario is to assume that all-time series segments follow the same data distribution, but this is clearly inappropriate in air pollutant prediction.Another approach is to use some adaptive algorithms to reduce the distributional differences between the data and thus learn the invariant knowledge of the data domain, such as Domain Adaptation (DA) 35 and Domain Generalization (DG) 36,37 , which in turn are differentiated in that the former aims at reducing the distributional differences between the training data and the test data by learning a domain-invariant representation, and the latter hopes to learn a domaininvariant model over multiple source domains to learn a domain-invariant model which generalizes well to the target domain.Unfortunately, atmospheric pollutants are not only time-varying but also have a strong sequence structure, making it difficult for DA and DG methods to address the data distribution differences effectively.
In order to better represent the distribution information in the time series, this paper proposes a temporal distribution characterization (TDC) algorithm, which is described in detail in Section TDC.According to the principle of maximum entropy, the training data is partitioned into K time periods with large distribution intervals to train the model; when the prediction model can have good generalization between periods with significant differences in the data distribution, then the performance must also be better for periods with more minor differences in the distribution.TDC achieves the time series partitioning by solving an optimization problem, which can be formulated as follows: where 1 , 2 and K 0 are pre-set parameters to avoid meaningless solutions.dselects CORAL as the similarity measure function, and the covariance distance of the distribution samples represented by CORAL is shown in Eq. ( 2).
(1) www.nature.com/scientificreports/where q is the dimension of the features and C S , C t is the covariance matrix of the distribution.

Temporal distribution matching
After the TDC module, which obtains the least similar sequences of K segments, the TDM module assigns dif- ferent temporal self-attention to the period sequences according to the importance of the period.In particular, in order to learn the temporal distribution properties and sequence correlations, AdaBiGRU adaptively matches the distributions among BiGRU units for each period using a multi-head external attention mechanism while capturing the temporal dependencies.The details are as follows.

Temporal self-attention mechanism
In deep learning, the self-attention mechanism 38 is a vital model structure used to improve the model's attention to and processing of input data.The self-attention mechanism allows the model to selectively focus on the essential parts and ignore the unimportant parts when processing the input data, thus improving the performance and effectiveness of the model.In this paper, we calculate the degree of correlation between each location of the input data and other locations through the temporal self-attention mechanism layer to get the weight of each location.By calculating the weights, the model can focus more on this task-relevant information and improve its processing power.According to Eq. 1, a plurality of period segment data Z ={z(t)|t = d, d + 1,..., K } is used as input to the TSAM layer.The data for each period segment can be represented as: z(t)= x (t,1) , x (t,2) ,..., x (t,d) , x (t,1) ∈ R m ,(1, 2,..., d), d is the length of each period.As shown in Fig. 3. Periodic data is passed through the TSAM layer to obtain a mapping relationship between time instances, as shown in Eqs. ( 3) and (4): x i denotes the i th temporal data,W i and b i denote the preset weights and biases corresponding to the i th temporal data,T is the device operation,σ is the sigmoid activation function, and a i denotes the temporal attention weight corresponding to the i th temporal data.
Finally, the temporal attention weight a i corresponding to each temporal data is multiplied with the corre- sponding sample data x i to obtain the output f i of each period sample in the temporal self-attention mechanism layer, and the output F of the whole temporal self-attention mechanism layer is used as the input of the subse- quent BiGRU.As shown in Eq. ( 5).

Bidirectional gated recurrent neural network
Gated Recurrent Unit (GRU) is a Recurrent Neural Network (RNN) variant for processing sequential data designed to solve the problem of gradient vanishing in traditional Recurrent Neural Network.Compared with traditional Recurrent Neural Network, GRU has better long-term dependency modeling capability and higher computational efficiency, and its main feature is the introduction of two gating units, reset gate and update gate, which decide how the information flows through the sequence by learning.The reset gate controls the effect of the previous moment's hidden state on the current moment's inputs.In contrast, the update gate determines how much information is retained by the hidden state of the previous moment to be passed on to the next moment.
The structure of GRU is shown in Fig. 1a.The data transfer process of GRU can be described as follows: σ denotes the sigmoid activation function, tanh denotes the hyperbolic tangent function, f t is the input vector per unit time, h t and h t−1 are the outputs of times t − 1 and t , respectively.z t and r t are the outputs of the update gate and reset gate, respectively, as in Eqs.6 and 7 above, and c t is the candidate state, as in Eq. 8 above.U z , U r and U c are the connectivity matrices of the update gate, reset gate, and candidate states to the inputs, respec- tively.W z , b z , W r , b r , W c , b c are the weights and deviations of the update gate, reset gate, and candidate state, respectively.⊙ for the dot product operation.
The GRU transmission direction is unidirectional from front to back.However, the temporal data correlation is strong; the current moment state is related to the previous moment state and the next moment state.Therefore, for the problem of air pollutant concentration prediction, it is necessary to study the inverse time series and apply the BiGRU network to air pollutant concentration prediction.The BiGRU function combines the hidden layer states by developing two different loop layers, forward and backward, and the base structure of BiGRU is shown in Fig. 1b.Assuming that the input time series has a time window of size d , The input to the forward GRU is f t (t = 1, 2,..., d) after the forward iteration, The forward output sequence of the implicit layer is shown in Eq. (10).
−−→ GRU denotes the forward mapping relation of the GRU.The input sequence f t (t = d, d − 1,..., 1) reverses input for the reversed GRU is shown in Eq. (11).
where, ←−− GRU is the mapping relation of the backward GRU.Combining the above equations, the output h t of the hidden layer when t is shown in Eq. (12).
In order to adaptively match the distribution between BiGRU units in each period while capturing the temporal dependency, a multi-head external attention mechanism is introduced to allocate enough attention to the critical information output from the implicit layer of the BiGRU network to learn the essential local information, as shown in Fig. 4 below.The output of the BiGRU layer is characterized by a matrix of F ∈ R N×d , where N is the number of features affecting the parameter and d is the dimension of the feature.The self-attention mechanism linearly maps this input to a query matrix Q ∈ R m×d k , key matrix K ∈ R m×d k , and the value matrix V ∈ R m×d v .However, in practical applications, we often use two different memory cells M K and M v as keys and values in order to increase the size of the network capacity, and the single-head external attention matrix is shown in Eq. ( 13).
where M k and M v are learnable parameters, functioning as a memory.The external attention (a) i,j is the similarity between the i feature and the j row of the M .Update the input features of the external storage unit based on the similarity of the attention matrix.Based on the above single-head external attention mechanism, the multi-head external attention mechanism can be obtained by computing the attention multiple times on the outputs of different BiGRU units.The ith external attention is shown in Eqs. ( 14) and ( 15). ( 6) where h i is the ith head, H denotes the number of heads, W is a linear transformation matrix, it is designed to keep the input and output dimensions consistent.M K ∈ R S×d and M v ∈ R S×d are used to compute the shared units of attention for each head.

Transfer learning
Transfer learning 39 is a method of learning by applying knowledge or models learned from one task to solve another related task.The domain, task, and marginal probabilities are used in transfer learning to describe transfer learning; the domain D contains two parts, the feature space X , and the marginal probability distribu- tion P(X) , as shown in Eq. ( 16).
On the other hand, task T also contains two parts, the feature space γ , and the objective function f (•) , as shown in Eq. (17).
where f (•) is obtained by learning from the training sample x i , y i .
The idea of transfer learning is to improve the prediction accuracy on the target domain task T T and target domain D T by utilizing the relevant knowledge learned from the source domain D S and the source task T S , where D S = D T , T S = T T .The schematic diagram is shown in Fig. 5 below.
The primary transfer learning methods can be divided into three categories: instance transfer learning, feature transfer learning, and model transfer learning.Instance transfer learning assigns high weights to samples with highly similar data distributions in the source and target domains, which accomplishes the transfer learning process.Feature information transfer learning is used to obtain the feature representation of inter-domain data in the relevant feature space so that the inter-domain data distribution differences are more similar than data feature extraction, and then the transfer learning process is completed.Model parameter transfer learning, on the other hand, is more intuitive and involves retaining the main structural hyper-parameters of the original model and then performing layer-specific fine-tuning of the parameters adapted to the target domain data, thus completing the transfer learning process.
This paper uses model parameter transfer learning, where knowledge in the source domain is shared with the target domain task for transfer.The specific process is as follows: firstly, freeze the last four layers of the model and train the network in the source domain data, and after training a certain amount of Epoch, observe the  www.nature.com/scientificreports/fitting effect of the model and retain the model parameter information; then, unfreeze the frozen layers to add a new fully-connected layer, and fine-tune the parameters of the fully-connected layer by using the data from the target domain to get the final atmospheric pollutant prediction model for the target site.

Description of the algorithm
In order to facilitate the design and implementation of the proposed air pollutants prediction approach, the necessary steps are summarized as Algorithm 1 in this paper.Algorithm 1. Air pollutants prediction via TL-AdaBiGRU

Dataset description and preprocessing
Over the past few decades, Beijing has experienced rapid urbanization, industrial production, and energy consumption; however, this growth has also resulted in severe air pollution problems.A large number of pollutants are emitted every year, leading to a continuous decline in atmospheric quality.In this paper, the Beijing Municipality in China was selected as the study area, and the dataset was obtained from the Beijing Embassy in Foreign Countries (http:// archi ve.ics.uci.edu/ml/datasets/Beijing+Multi-Site+Air-Quality+Data) 9 sites from March 2013 to February 2017 for atmospheric quality information.The locations of the atmospheric monitoring stations in this paper are shown in Fig. 6 below.
In this study, PM 10 was selected as the prediction target, and in order to characterize the distribution of PM 10 , a violin plot with a box shape was created with PM 10 at each station, as shown in Fig. 7 below.The distribution of PM 10 data at each site can be observed in the figure, and the maximum value is set in the violin plot; in this paper, the data more significant than the maximum value is called anomalous data, and the anomalous data is recorded as missing values.For PM 10 concentration series data, the inconsistency of time stamps affects the prediction accuracy.Therefore, a linear interpolation algorithm is used to fill in the missing data, and the linear interpolation processed data is closer to the original data than the average interpolation method.In order to eliminate the dimensionality effect of the features and to improve the efficiency of the model operation, the maximum-minimum normalization method is used to make the data mapped in the same range.Atmospheric pollutants not only affect each other, but temperature and barometric pressure also have a strong influence on the pollutant effects; we plotted the Spearman correlation coefficient heat map as shown in Fig. 8, in which the temperature is negatively correlated with PM 2.5 , SO 2 , CO, and positively correlated with PM 10 , CO, PM 2. 5 , SO 2 , and NO 2 are positively correlated with the barometric pressure.The dew-point temperature is correlated with PM 2.5 , PM 10 , NO 2 , and O 3 were positively correlated, and negatively correlated with SO 2 and CO.Rainfall showed a positive correlation with PM 2.5 , CO and O 3 , negative correlation with SO 2 and NO 2 , wind speed was positively  www.nature.com/scientificreports/correlated with O 3 and negatively correlated with the remaining five pollutants.The overall correlation between atmospheric pollutants and meteorological factors in the thermograms is weak, so the meteorological factors are entered as input layers with the auxiliary of the model input parameters.

Source domain site selection
The purpose of this study is to explore the impact of transfer learning on the predictive performance of sites lacking historical data, the paper selected the Dongsi monitoring site as the target site, and the 6-month data from 2016/1 to 2016/7 was selected as the Dongsi site dataset.The limited historical data at the Dongsi site does not satisfy the need for deep learning model convergence.Therefore, in addition to the general features in the transfer pre-training model, source domain data are still needed to assist in learning the features of the target task, and the source domain monitoring sites play a crucial role in transferring the meteorological and temporal knowledge to the target domain sites.In this paper, we adopt the Maximum Mean Discrepancy (MMD) method to measure the similarity between the source domain monitoring sites and the target monitoring sites.The MMD method can efficiently measure the scatter of first-order distributions in the Reproducing Kernel Hilbert Space (RKHS).Datasets A ={a i } n 1 i=1 and B ={b i } n 2 i=1 .The MMD of A and B is shown in Eq. (18).
where H denotes the RKHS,�(•) is the nonlinear mapping function from the original data space to the RKHS, and p and q denote the probability distributions of the two datasets.MMD is further squared to obtain more precise results as shown in Eq. ( 19).
The Gaussian Radial Basis Function (RBF) k a i , b j = exp − a i − b j 2 /2γ 2 is used where k�•, •� is the kernel function.Many studies have shown that multi-core MMD methods can improve domain adaptation 40 , and the kernel representation of N k RBF is as follows.
where k i denotes the RBF kernel with bandwidth parameter γ 2 i , the MMD between the source domain site and the target site is shown in Eq. ( 21).
where M is the total number of source domain site samples.The smaller the value of MMD, the higher the similarity with the target site; the results are shown in Table 1.The MMD values of Tiantan, Shunyi, Changping, and Dongsi are 0.669, 0.668, 0.667 respectively, and the MMD values of Guanyuan, Huairou, and Wanliu are 0.674, 0.657, 0.656 respectively, the above MMD values are all bigger than that of the Aotizhongxin value.Therefore, we selected the Aotizhongxin site as the source domain data set.The site, auxiliary target site, and the data of the Aotizhongxin site for 42 months from 2013/1 to 2016/7 were selected as the source domain dataset.The descriptive data statistics of the Dongsi site (target site) and the Aotizhongxin site (source domain site) are shown in the following Table 2.

Result Model parameters and evaluation indicators
According to the Table 1 results with the Aotizhongxin site as the source domain site, the data of 42 months from 2013/1 to 2016/7 are collected as the source domain dataset for model pre-training.80% of its data are used as the training set, 10% as the testing set, and 10% as the validation set.The source domain site data are input into AdaBiGRU after outlier detection, missing value filling and normalization, period segmentation by the TDC layer, and allocation of different weights by temporal self-attention mechanism.In this paper, the lag time is set to 24 h, the Dropout is 0.5, and the model is optimized using Adam optimizer with a learning rate of 0.005, Batch size set to 36, activation function of Relu, and loss function of MSE.In this paper, we utilize the root mean squared error (RMSE), the mean absolute error (MAE), and the mean absolute percentage error (MAPE) as three evaluation metrics to evaluate the prediction performance of AdaBiGRU.The formulas for these three metrics are as follows.where n denotes the number of samples,y i denotes the observed value of the i-th sample, and y * i denotes the predicted value of the i-th sample.The smaller the value of these three indicators, the higher the prediction accuracy and the better the model's performance.

Comparison of pre-trained models
In order to test the performance of the AdaBiGRU model, this paper compares it with five prediction models, namely, ARIMA, GRU, BiGRU, LightGBM, and Transformer, at four sites, namely, the Gucheng, the Tiantan, the Aotizhongxin, and Wanliu, and the results are shown in Table 3.For PM 10 concentration, the error values of both ARIMA and LightGBM are higher than those of GRU, BiGRU, Transformer, and AdaBiGRU, which suggests that the time-series neural network model has higher prediction accuracy in atmospheric quality prediction.BiGRU predicts better than GRU.The performance of the Transformer is better than GRU and BiGRU, indicating that the model based on the attention mechanism performs better than the traditional model.In addition, the proposed AdaBiGRU model has smaller values than GRU, BiGRU, and Transformer, proving that AdaBiGRU is effective when applied to the problem of atmospheric pollutant concentration prediction.

TL-AdaBiGRU
In order to improve the prediction performance of the model in limited data sites, this paper implements TL-AdaBiGRU by combining AdaBiGRU with model parameter transfer learning.The model is first trained on sufficient source domain datasets to determine the optimal model parameters; then, the last four layers of the model are frozen, and the model parameter information is retained after a certain amount of Epoch training.Finally, the frozen layers were unfrozen, and a new fully connected layer was added to fine-tune the source domain model using the target domain data to improve the prediction accuracy at the target site.The frozen layers of the model need to be identified before fine-tuning the model, which serves to preserve the knowledge learned by the pre-trained model on the source domain data and to prevent performance degradation due to over-tuning on the target domain data.The number of freezing layers directly affects the prediction performance of the model.If the number of freezing layers is too small, the model may not be able to learn enough "knowledge" from the source data.If the number of freezing layers is too large, the model will not be able to adjust enough parameters for the target data, which will affect the prediction effect.Therefore, to make the model have better prediction performance, selecting the appropriate number of freezing layers is a key issue.The AdaBiGRU model was pretrained using PM 10 concentration data from the Aotizhongxin site.Eighty percent of the samples collected from the Dongsi site for six months of data from 2016/1 to 2016/7 were used to fine-tune the model with different numbers of freezing layers; 10 percent was used for testing and 10 percent for validation.The results presented in Table 4 below show that the values of the three metrics decrease as the number of freezing layers increases, reaching a minimum when the number of freezing layers is 4.This is because when the number of freezing layers is too small, the model is affected by noise from other sites.As the number of frozen layers increases, the model is gradually less affected by noise from other sites, and the performance improves.When the number of frozen layers is more than 4, the error increases as the number of frozen layers increases, and this result is due to the overfitting of the model to the auxiliary sites.Therefore, this paper sets the number of frozen layers to 4. In order to verify the validity and reasonableness of the number of freezing layers of the model, we used the same method to experiment with the number of freezing layers of PM 2.5 and NO 2 pollutants and determined the optimal number of freezing layers is also four layers.After that, the transfer model was tested using 20% of the data from the Dongsi site, and the comparison between the predicted and real values is shown in Fig. 9. Compared with the AdaBiGRU model, the fitting effect of the TL-AdaBiGRU model is significantly improved.

Discussions
The performance of the proposed methodological framework for atmospheric site prediction is presented in the previous sections.Its reliability and applicability still need to be further explored.This section focuses on the period segmentation of the time-similarity quantization algorithm, the validation of the model's effectiveness at other monitoring stations, and the prediction effectiveness of the proposed model for other pollutants.www.nature.com/scientificreports/

Time similarity quantization period segmentation
In section temporal distribution characterization above for the air pollutant data is periodic and non-stationary, the data distribution changes dynamically over time; in order to better characterize the distribution information in the air pollutant series, this paper adopts dynamic programming (DP) to solve the optimization problem of Eq. (1).First, the time series is uniformly partitioned into N = 10 parts, each of which is the most minor unit period that cannot be subdivided.Then, the value of a is chosen randomly for K range of values of K ={2, 3, 4, 5, 6, 7, 8, 9, 10} .For a given value of K , a greedy strategy is used to choose the length n j of each period.

Validation of other monitoring sites
In order to verify the validity of the model proposed in this paper, we compared TL-AdaBiGRU with six models, namely, ARIMA, GRU, BiGRU, LightGBM, and Transformer, AdaBiGRU, at the Huairou monitoring site.We selected the 6-month data from 2016/6 to 2016/12 at the Huairou monitoring station as the dataset and predicted the PM 10 concentration for 2017/1/1/0:00 a.m.-1/3/12:00 a.m.(60 h in total).It can be seen from Fig. 11 that with less data, the PM 10 concentration predicted by the TL-AdaBiGRU model is closer to the actual value compared with the other models closer to the real value.The model effectively alleviates the problems of low prediction accuracy and weak generalization ability caused by the small amount of data.The model proposed in this paper is also very effective in multi-step prediction, predicting the next 6, 12, 18, and 24 h, as shown in Fig. 12.

Predictive applications for other pollutants
The TL-AdaBiGRU model proposed in this article has achieved high accuracy in predicting PM 10 concentration.
In order to further verify the generalization of the model, we used the dataset from Huairou Station to predict  but also efficiently capture the time dependence of the time series.Then, based on the pre-trained model, a finetuning strategy is used to freeze the last few layers of the pre-trained model and fine-tune the remaining layers using the target domain data.The fine-tuned model can transfer the knowledge learned at the source site to the target site, thus improving the prediction accuracy.In this paper, experiments were conducted using air pollutant data from Beijing, and the main results are as follows: • Quantifying temporal distribution characterization can be an excellent way to deal with air pollutant con- centration data characterized by periodicity and dynamic changes in data distribution over time.• The two-stage attention mechanism of the model can better analyze the nonlinear relationship between the air pollutant data, and in the PM 10 concentration prediction experiments, the prediction results of the TL-AdaBiGRU proposed in this paper are better than those of AdaBiGRU, Transformer, BiGRU, GRU and LightGBM.• Transfer learning can effectively improve the performance of pollutant concentration prediction at data shortage sites, and other pollutant prediction experiments were conducted at data shortage sites with good results, verifying that the model has strong generalization.
The contribution of this study lies in the fact that a TL-AdaBiGRU model is proposed to solve the problem of the small amount of historical data of newly built air quality monitoring stations and the problem that the time series data of air pollutants have periodicity and the data distribution changes dynamically with time, and the prediction accuracy of the proposed model at newly built stations is significantly improved.Taking Beijing's air pollutant concentration data as an example, this paper proves that the model has higher accuracy.Of course, the method proposed in this paper also has limitations.Firstly, since the idea of transfer learning is to "learn from similar time series, " the current method can only rely on having similar sites to assist in learning the target.If there is no such a learning target, transferring learning is not feasible.Second, this study only predicted pollutant concentration data for a few cities, and the migration analysis of the model was not comprehensive enough.Future work could apply the model to predict pollutant concentrations in multiple areas.In addition, the model can be applied to studying other time-series data predictions, such as stock price predictions, power load data predictions, and traffic flow predictions.Third, although the method proposed in this paper improves the accuracy of pollutant prediction, its superior performance cannot be supported by high-quality data, especially under different geographic conditions and infrastructures, and its applicability needs to be further improved in future studies.In future studies, we will try to consider other aspects, such as combining the knowledge of metatransfer learning, domain adaptation, and domain generalization, to consider the generalization and robustness of the model under different environments and infrastructures to further improve the overall performance of the prediction model.

Figure 5 .
Figure 5.The working process of transfer learning.

Input: dataset for
the prediction process (including source and target domain data) Output: data from predicted target sites S1: Perform outlier testing, missing value filling and normalization of data S2: Source domain site selection according to Eq. (18) S3: Quantized by time similarity according to Eq. 1 into different period segments S4: Initializing epoch = 1 and Epochmax in AdaBiGRU S5: While epoch ≤ Epochmax do S6: Assign appropriate weights to time instances according to Eqs. (3)-(5) S7: According to Eqs. (10)-(12), BiGRU learns the time-dependent features between data.S8: Mining hidden layer features according to Eqs. (13)-(15) Multi-head external attention mechanism S9: Enabling mapping from features to fully connected layers S10: Update the parameters of the network layer through S6-S9 S11: epoch → epoch + 1 S12: end while S13: Transfering the parameters of AdaBiGRU S14: Processing of the target domain data according to S1 S15: Periodic segmentation of the target domain data according to S3 S16: Input the target domain data into TL-AdaBiGRU, output the predicted values, and evaluate the model prediction performance according to Eqs. (22)-(24)

Figure 6 .
Figure 6.Distribution of the locations of the nine atmospheric monitoring stations in Beijing.Blue triangles represent stations with less historical data, and red triangles represent source domain stations with sufficient data.(This Figure is drawn by using Microsoft Visio software, the version number is 16.0.10730.20102and the link to the software is http:// offic ecdn.micro soft.com/ pr/ 49235 0f6-3a01-4f97-b9c0-c7c6d df67d 60/ media/ zh-cn/ Visio Pro20 19Ret ail.img).

Figure 7 .
Figure 7.A violin plot with box plots showing the distribution of PM 10 data at each site, with a maximum value set and data exceeding the maximum value identified as outliers.

Figure 8 .
Figure 8. Spearman's correlation coefficient between pollutants and meteorological data.The Spearman correlation coefficient values range from − 1 to 1.The larger the absolute value of the coefficient, the stronger the correlation between the two variables.
Use A and B to denote the start and end points of the time series, respectively.First, consider the case of K = 2 and maximize the distribution distance d(S AC , S CB ) by choosing a segmentation point (denoted as C ), specifically, choosing one of the N segments as C such that d(S AC , S CD )+d(S DB , S B ) is maximized.In this way, the time series is divided into three parts: [A, C],[C, D] and [D, B] .Similarly, K = 4, 5, 6, 7, 8, 9, 10 , the same strategy is used to maximize the distribution distance.With the greedy strategy, the optimal splitting point can be selected so that the length of each period of the time series can be more evenly distributed, thus obtaining a better prediction model performance.In order to verify the effectiveness of the proposed method, experiments were carried out at two sites, Changping and Shunyi, as shown in Fig.10abelow; with the increase of K , the model performance first becomes better and then worse, and the model performance is the best when K = 4, 6 and the model perfor- mance gradually decreases with the increase of K .The model performance of K = 4, 6 is the best, and the model performance gradually decreases with the increase of the K value.In order to verify the effectiveness of temporal distribution characterization for segmentation of atmospheric pollutant sequences, comparative experiments were carried out as shown in Fig.10bbelow; Split1 represents random partitioning, Split2 represents partitioning based on closest similarity, and Split3 represents partitioning quantified by temporal similarity.Our TDC divides the atmospheric pollutant sequence into the time periods with the greatest distribution distance, which means that RMSE is the best when partitioning into the least similar time periods.

Figure 11 .
Figure 11.Comparison of different models at the Huairou monitoring site.The red solid line is the real value, the blue dotted line represents TL-AdaBiGRU, the green dotted line represents AdaBiGRU, the pink dotted line represents BiGRU, the indigo dotted line represents GRU, the brown dotted line represents LightGBM, the purple dotted line represents ARIMA, and the yellow dotted line represents Transformer.

Figure 12 .
Figure 12.Comparison of the effects of multi-step prediction effects.(a) Figure shows the prediction effects of the models at 6 and 12 h.(b) Figure shows the prediction effects of the models at 18 and 24 h.

Figure 13 .
Figure 13.Predicted results of PM 2.5 , NO 2 , SO 2 and O 3 concentrations.The red part represents the real value, the blue represents the TL-AdaBiGRU model, the yellow represents the Transformer model, the green represents the BiGRU model, and the gray represents the LightGBM.

Table 1 .
MMD values between target atmospheric monitoring sites and neighboring atmospheric monitoring sites.

Table 2 .
The descriptive data statistics of the target site and the source domain site.

Table 3 .
Comparison of effects of pre-trained models.Significant values are in bold.

Table 4 .
Impact of the number of frozen layers on prediction accuracy of model.
Figure 9.Comparison of actual and predicted values on PM 2.5 , PM 10 , and NO 2 by AdaBiGRU and TL-AdaBiGRU models.